Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

VinInn · 2021-10-18T08:57:29Z

In this PR I wish to share code that used cooperative groups to reduce the number of kernels used to populate "Histograms" (actually OneToMany Associations) in Patatrack.

In unit tests (single Thread) the gain in speed is noticeable (even in just the prefix scan).
In standard multithread multi-stream workflows a loss in throughput can easily be observed if the maximum number of blocks is allocated. Some fine tuning of the number of blocks allocated to each kernel (even just one block?) makes this PR at least as fast as the standard multi-kernel implementation.

More comments inline.

The code is "configured" to run with cooperative groups: of course the actual PR can be merged with the standard multi-kernel implementation as default.

VinInn · 2021-10-18T08:59:28Z

HeterogeneousCore/CUDAUtilities/interface/HistoContainerAlgo.h

+namespace cms {
+  namespace cuda {
+
+    template <template <CountOrFill> typename Func, typename Histo, typename... Args>


This is not used (yet?) It may make the syntax more complex, not simpler

cmsbuild · 2021-10-18T09:04:53Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35713/26023

This PR adds an extra 152KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc modified in PR(s): Improve various Patatrack Kernels #35598
- File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsImpl.h modified in PR(s): Improve various Patatrack Kernels #35598

cmsbuild · 2021-10-18T09:05:15Z

A new Pull Request was created by @VinInn (Vincenzo Innocente) for master.

It involves the following packages:

HeterogeneousCore/CUDAUtilities (heterogeneous)
RecoLocalTracker/SiPixelClusterizer (reconstruction)
RecoLocalTracker/SiPixelRecHits (reconstruction)
RecoPixelVertexing/PixelTriplets (reconstruction)
RecoPixelVertexing/PixelVertexFinding (reconstruction)

@jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @dgulhan, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

VinInn · 2021-10-18T09:01:07Z

HeterogeneousCore/CUDAUtilities/interface/HistoContainerAlgo.h

+      auto kernel = fillManyFromVectorCoopKernel<Histo, T>;
+      auto nblocks = (totSize + nthreads - 1) / nthreads;
+      assert(nblocks > 0);
+      auto nOnes = view.size();


ok, a huge stack of boiler plate. could be partially encapsulated in a "launch" interface as in launch.h.

If you want to give it a try, there is launch_cooperative(...) in launch.h .
I don't think I've ever tested it, though.

VinInn · 2021-10-18T09:02:03Z

HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h

    template <typename T>
    inline T __ldg(T const* x) {
      return *x;
    }

+    namespace cooperative_groups {


from @fwyzard contribution to patatrack-alone

VinInn · 2021-10-18T09:02:58Z

HeterogeneousCore/CUDAUtilities/interface/maxCoopBlocks.h

+#define GET_COOP_RED_FACT_FROM_ENV
+
+// to drive performance assessment by envvar
+#ifdef GET_COOP_RED_FACT_FROM_ENV


makes life easy. Not supposed to be used in production.

VinInn · 2021-10-18T09:03:23Z

HeterogeneousCore/CUDAUtilities/interface/maxCoopBlocks.h

+#include <cstdlib>
+
+template <typename F>
+inline int maxCoopBlocks(F kernel, int nthreads, int shmem, int device, int redFact = 10) {


to be moved to CUDAService?
MUST be called at max once per job (per device? per kernel?)

Does CUDA itself require that it must be called at most once per job, or calling it many times would be slow (either by itself or it causes synchronization)?

Either way, a good question. Probably the best place to cache the values would be in CUDAService. Can the number of threads per block and size of shared memory per block vary between events (in general)?

I checked: major slowdown

This would go well with a more general development I've been thinking about for a while (and that Abdulla may work on, if he comes to CERN in January): making the launch configuration of each kernel configurable, with a common interface.

VinInn · 2021-10-18T09:05:23Z

HeterogeneousCore/CUDAUtilities/interface/prefixScan.h

@@ -183,6 +184,52 @@ namespace cms {
        co[i] += psum[k];
      }
    }
+
+    template <typename T>
+    __device__ void coopBlockPrefixScan(T const* ici, T* ico, int32_t size, T* ipsum) {


This is really faster than the above (at least if all required blocks are available)

VinInn · 2021-10-18T09:07:26Z

HeterogeneousCore/CUDAUtilities/test/OneToManyAssoc_t.h

+  int maxBlocks = maxCoopBlocks(populate, nThreads, 0, 0, 0);
+  std::cout << "max number of blocks is " << maxBlocks << std::endl;
+  auto ncoopblocks = std::min(nBlocks, maxBlocks);
+  auto a1 = v_d.get();


one cannot get a pointer to the return value of .get()

VinInn · 2021-10-18T09:09:25Z

HeterogeneousCore/CUDAUtilities/interface/maxCoopBlocks.h

+  auto env = getenv("COOP_RED_FACT");
+  int redFactFromEnv = env ? atoi(env) : 0;
+  if (redFactFromEnv != 0)
+    redFact = redFactFromEnv;


this is a "global reduction factor" to reduce the number of required blocks to launch a cooperative groups.
Maybe shall be tuned kernel by kernel: a bit of a mess...

I don't understand this: why a "reduction factor" ?
If it needs to be tuned kernel by kernel, the effect is the same a setting a hard limit on the number of blocks.

Because I hope that there will be no need for a tune kernel by kernel to get reasonable performance for any kind of wf and event size/type

By the way, on the T4 and V100, what are the maximum number of blocks reported by CUDA ?

VinInn · 2021-10-18T09:13:17Z

RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu

+  using View = caConstants::TupleMultiplicity::View;
+  View view = {tupleMultiplicity_d, nullptr, nullptr, -1, -1};
+
+  int blockSize = 128;


duplicated boilerplate.
The effort of factorization may be waisted if one would decide to get rid of the TupleMultiplicity container and filter multiplicity in the fit routine....

VinInn · 2021-10-18T09:14:11Z

@cmsbuild , please test

cmsbuild · 2024-02-06T10:09:37Z

Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X.

smuzaffar · 2024-02-12T20:07:06Z

ping

cmsbuild · 2024-08-27T08:08:38Z

Milestone for this pull request has been moved to CMSSW_14_2_X. Please open a backport if it should also go in to CMSSW_14_1_X.

antoniovilela · 2024-09-03T09:43:35Z

ping (to make bot change milestone)

cmsbuild · 2024-11-22T13:10:45Z

Milestone for this pull request has been moved to CMSSW_15_0_X. Please open a backport if it should also go in to CMSSW_14_2_X.

VinInn added 16 commits October 12, 2021 12:28

add cooperative groups

650c971

works with coop

e654481

coop works in assoc

0614694

coops implelented in histo filling

df68916

format

f71031a

use in rechits

ac394e7

factorize away algos

a3ab3ff

decapsulate and format

8a5d69b

encapsulate

6974029

format

0c1e5f4

use coop for other 2 assoc

151aea7

Merged CUDACOOP from repository VinInn with cms-merge-topic

e9a9bda

drive performance tests by envvar

b8e2760

add comment

446d652

factorize, encapsulate

5f6f596

propagate factorization

71631b2

cmsbuild added this to the CMSSW_12_1_X milestone Oct 18, 2021

cmsbuild added code-checks-pending heterogeneous-pending orp-pending pending-signatures reconstruction-pending tests-pending labels Oct 18, 2021

VinInn commented Oct 18, 2021

View reviewed changes

cmsbuild added code-checks-approved and removed code-checks-pending labels Oct 18, 2021

VinInn commented Oct 18, 2021

View reviewed changes

cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 6, 2024

cmsbuild modified the milestones: CMSSW_14_0_X, CMSSW_14_1_X Feb 12, 2024

cmsbuild modified the milestones: CMSSW_14_1_X, CMSSW_14_2_X Aug 27, 2024

cmsbuild mentioned this pull request Sep 1, 2024

Remove legacy CUDA modules for pixel track and vertex reconstruction #45853

Draft

cmsbuild modified the milestones: CMSSW_14_1_X, CMSSW_14_2_X Sep 3, 2024

cmsbuild modified the milestones: CMSSW_14_2_X, CMSSW_15_0_X Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

VinInn commented Oct 18, 2021 •

edited

Loading

VinInn Oct 18, 2021

cmsbuild commented Oct 18, 2021

cmsbuild commented Oct 18, 2021 •

edited

Loading

VinInn Oct 18, 2021

fwyzard Dec 9, 2021

VinInn Oct 18, 2021

VinInn Oct 18, 2021

VinInn Oct 18, 2021

makortel Oct 25, 2021

VinInn Nov 16, 2021

fwyzard Dec 6, 2021

VinInn Oct 18, 2021

VinInn Oct 18, 2021

VinInn Oct 18, 2021

fwyzard Dec 9, 2021

VinInn Dec 9, 2021

fwyzard Dec 9, 2021

VinInn Oct 18, 2021

VinInn commented Oct 18, 2021

cmsbuild commented Feb 6, 2024

smuzaffar commented Feb 12, 2024

cmsbuild commented Aug 27, 2024

antoniovilela commented Sep 3, 2024

cmsbuild commented Nov 22, 2024

Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

Are you sure you want to change the base?

Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

Conversation

VinInn commented Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

cmsbuild commented Oct 18, 2021

cmsbuild commented Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VinInn commented Oct 18, 2021

cmsbuild commented Feb 6, 2024

smuzaffar commented Feb 12, 2024

cmsbuild commented Aug 27, 2024

antoniovilela commented Sep 3, 2024

cmsbuild commented Nov 22, 2024

VinInn commented Oct 18, 2021 •

edited

Loading

cmsbuild commented Oct 18, 2021 •

edited

Loading